Your report must be of high quality, meaning that your report:
is visually and textually pleasing of
does not look/read/feel like a draft instead of a finished analysis
explains/discusses your findings and results in the main text, e.g., explain/discuss all figures/table in the main text
is representable such that it can show to any interested third party
uses figure/table captions/linking/reference (see example further down)
Do not show any standard printout of R-code, use for data.frame/tibbles knitr::kable() printing.
Do not simply print datasets (too many lines) use instead rmarkdown::paged_table()
Introduction
This project applies and compares several machine learning methods to predict residential property prices. Using a structured housing dataset, the full data science pipeline is implemented, including data preparation, feature engineering, exploratory analysis, and regression modeling. Multiple models — including tree-based approaches and a neural network — are trained and tuned, and their predictive performance is evaluated on a separate test set using metrics such as RMSE, MAE, and R².
In addition to performance comparison, the project also focuses on model interpretation to understand which features most influence price predictions. The results are summarized in a compact dashboard, providing an overview of model accuracy and key insights. Overall, the project demonstrates how different machine learning techniques can be systematically evaluated in a real-world regression task.
In this data exploration we are looking at the US Real Estate market with use of a dataset from kaggle published by Ahmed Shahriar Sakib. It contains over 2.2 Million Real Estate listings broken down to State, Size, Price (among other factors). (Source: https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset/data)
Data import
data =read.csv("data/realtor-data.zip.csv") # Data import
Data Transformation
data =subset(data, select =c(status, price, bed, bath, acre_lot, city, state, house_size)) # keep relevant columns
Structure with standard datatypes
Code
structure_tbl <- tibble::tibble(Variable =names(data),Type =sapply(data, function(x) class(x)[1]),Example =sapply(data, function(x) { val <-unique(x[!is.na(x)])[1]if (is.factor(val)) as.character(val) }),Missing =sapply(data, function(x) sum(is.na(x))))kable( structure_tbl,caption ="Structure summary of the dataset",align =c("l", "l", "l", "r"))
Structure of cleaned dataset
Variable
Type
Example
Missing
status
character
NULL
0
price
numeric
NULL
1541
bed
integer
NULL
481317
bath
integer
NULL
511771
acre_lot
numeric
NULL
325589
city
character
NULL
0
state
character
NULL
0
house_size
numeric
NULL
568484
Code
# Assign Data Typesdata$status =as.factor(data$status)data$city =as.factor(data$city)data$state =as.factor(data$state)
NA Removal
Code
before_rows <-nrow(data)data <-na.omit(data)after_rows <-nrow(data)kable(data.frame(Description =c("Before NA removal", "After NA removal"),Rows =c(before_rows, after_rows)))
Description
Rows
Before NA removal
2226382
After NA removal
1360716
The dataset now has 1360716 observations and 8 variables after removing rows with missing values.
Filtering
Code
# Filter min and max valuesdata = data |>filter(price >10000& price <1000000000)
Calculations
Code
data = data |>mutate(price_per_sqm = price/house_size)
Cleaned Dataset
Code
paged_table(data)
Structure after transformation
Code
structure_tbl <- tibble::tibble(Variable =names(data),Type =sapply(data, function(x) class(x)[1]),Example =sapply(data, function(x) { val <-unique(x[!is.na(x)])[1]if (is.factor(val)) as.character(val) elseas.character(round(val, 2)) }),Missing =sapply(data, function(x) sum(is.na(x))))kable( structure_tbl,caption ="Structure summary of the dataset",align =c("l", "l", "l", "r"))
Structure of cleaned dataset
Variable
Type
Example
Missing
status
factor
for_sale
0
price
numeric
105000
0
bed
integer
3
0
bath
integer
2
0
acre_lot
numeric
0.12
0
city
factor
Adjuntas
0
state
factor
Puerto Rico
0
house_size
numeric
920
0
price_per_sqm
numeric
114.13
0
After cleaning the dataset, all variables have appropriate data types and no missing values (n = 1360076):
status – Factor variable showing the listing status (e.g., for_sale, sold).
price – Numeric value for the property’s price in USD.
bed, bath – Integer counts of bedrooms and bathrooms.
acre_lot – Numeric size of the lot (in acres).
city, state – Factor variables identifying the property’s location.
house_size – Numeric size of the house (in square feet).
price_per_sqm – Numeric variable derived from price / house_size to compare prices across properties.
All rows with missing data were removed, and categorical variables were converted to factors for easier analysis and visualization later on.
Data dictionary
Code
tibble(Variable =c("price", "status", "acre_lot", "state", "house_size", "price_per_sqm"),Description =c("The price for which the item was listed on the market","The status if the house is already sold or still for sale","The size of the land / lot on which the house is located in acres","The state in which the house is located","The size of the house in square feet","The price per square footage" )) |>kable(caption ="Description of key variables in the dataset",align =c("l", "l") )
Description of key variables in the dataset
Variable
Description
price
The price for which the item was listed on the market
status
The status if the house is already sold or still for sale
acre_lot
The size of the land / lot on which the house is located in acres
state
The state in which the house is located
house_size
The size of the house in square feet
price_per_sqm
The price per square footage
Summary statistic tables
In this section we will cover the summary of our cleaned dataset. We will explore basic statistical values from our data.
Top categories for factor variables with counts, proportions, and mean price
Variable
Category
Count
Percent
Mean_Price
city
Houston
19226
1.41
477651
city
Tucson
7876
0.58
384816
city
Phoenix
7694
0.57
543665
city
Los Angeles
7556
0.56
1885626
city
Dallas
7510
0.55
587276
city
Philadelphia
7336
0.54
338467
city
Richmond
6592
0.48
392538
city
Orlando
6281
0.46
418841
city
Fort Worth
6171
0.45
389780
city
Saint Louis
5970
0.44
250029
state
California
170954
12.57
1095518
state
Texas
145394
10.69
451253
state
Florida
127675
9.39
649826
state
Arizona
54488
4.01
552916
state
Pennsylvania
51922
3.82
343792
state
New York
50935
3.75
669257
state
Georgia
49234
3.62
422988
state
Illinois
46901
3.45
357316
state
Washington
46450
3.42
728113
state
Virginia
44236
3.25
547994
status
for_sale
750493
55.18
621383
status
sold
609583
44.82
515063
Interpretation
city: Most listings in Houston, Tucson, Phoenix, and Los Angeles. Prices range widely — highest in Los Angeles (~$1.9M), lowest around $250k (Saint Louis).
state: California, Texas, and Florida dominate listings (>30% total). California shows the highest mean price (~$1.1M).
status: 55% for sale, 45% sold. Active listings are priced higher (~$621k vs. $515k).
Overall: Listings cluster in major U.S. cities and states, with strong regional price differences, especially high in California and large metro areas.
Visualisation of nominal variables (top categories)
Code
nominal_summary <- nominal_summary |>group_by(Variable) |>mutate(Category = forcats::fct_reorder(Category, Count),Category =factor(Category, levels =unique(Category))) |>ungroup()# Plot: Facets untereinander, mit eigener x-Skala und y-Skala pro Variableggplot(nominal_summary, aes(x = Count, y = Category, fill = Variable)) +geom_col(show.legend =FALSE, alpha =0.8, width =0.7) +facet_wrap(~ Variable, ncol =1, scales ="free", drop =TRUE) +scale_x_continuous(labels =label_comma()) +#Tausendertrennung, keine 1e+05theme_minimal() +labs(title ="Top Categories per Factor Variable",x ="Count",y ="Category" ) +theme(panel.spacing.y =unit(1, "lines"),strip.text =element_text(size =12, face ="bold"),axis.text.y =element_text(size =8),plot.margin = ggplot2::margin(5, 15, 5, 5) )
Very high Correlation (0.8) between Price and Price per Squaremeter: The pricier a house the more you pay per squaremeter, which is logical since the squaremeters arent the only criterion for price.
Also a minor correlations for price are found in the number of beds (0.34), baths (0.58) and the house size (0.549).
The number of beds and baths correlates with the house size.
Interestingly the Price per Squaremeter does not correlate with the house size.
The lot size interestingly does not correlate with any of the other variables.
Price vs. House Size by Status
Code
# Stichprobe ziehen für Performanceset.seed(123)sample_data <- data %>%sample_n(50000)plot_ly( sample_data,x =~house_size,y =~price,color =~status,type ="scatter",mode ="markers",alpha =0.6) %>% plotly::layout(title =list(text ="Relationship Between House Size and Price by Status"),xaxis =list(title ="House Size (sqft)"),yaxis =list(title ="Price ($)", type ="log") )
Interpretation
Positive relationship: Larger houses generally have higher prices, though the relationship weakens for very large properties.
Status comparison: Both for_sale and sold homes follow similar trends, but for_sale listings appear higher in price, suggesting sellers may list above sale values.
High variation: At similar sizes, prices vary widely — showing the strong influence of location and other factors.
Outliers: A few extremely large or expensive properties stretch the scale upward.
Average Property Price Map
Code
valid_states <-tibble(state_name =c(state.name, "District of Columbia"),state_abbr =c(state.abb, "DC"))
Regional variation: Western and coastal states show generally higher property prices, while central regions are lower.
Highest averages: States like California, New York, and Washington stand out with mean prices well above $1M.
Moderate prices: States such as Texas, Florida, and Arizona fall in the mid-range (~$400–650K).
Lower averages: Midwest and Southern states have more affordable properties on average.
Summary: Property values are heavily influenced by geography — with the highest prices concentrated along the coasts and major urban centers.
Average House Size Map
Code
map_size <- data %>%group_by(state) %>%summarise(avg_size =mean(house_size, na.rm =TRUE), .groups ="drop") %>%inner_join(valid_states, by =c("state"="state_name"))plot_ly( map_size,type ="choropleth",locationmode ="USA-states",locations =~state_abbr,z =~avg_size,text =~paste0(state, "<br>Avg Size: ", round(avg_size), " sqft"),colorscale =list(c(0, 1), c("lightgreen", "darkgreen")),colorbar =list(title ="Avg Size (sqft)")) %>% plotly::layout(title =list(text ="Average House Size by U.S. State"),geo =list(scope ="usa", projection =list(type ="albers usa")) )
Interpretation
General trend: Average house sizes are fairly consistent across most states, typically around 2,000–2,500 sqft.
Larger homes: Some central and mountain states (e.g., Colorado, Utah, Iowa) show slightly larger averages, possibly due to more available land.
Smaller homes: Coastal and densely populated states (e.g., New York, California) tend to have smaller average house sizes.
Price Range by US State
Code
map_extremes <- data |>group_by(state) |>summarise(min_price =suppressWarnings(min(price, na.rm =TRUE)),max_price =suppressWarnings(max(price, na.rm =TRUE)),.groups ="drop" ) |>mutate(range_price = max_price - min_price) |>inner_join(valid_states, by =c("state"="state_name"))plot_ly( map_extremes,type ="choropleth",locationmode ="USA-states",locations =~state_abbr,z =~range_price,text =~paste0( state,"<br>Min: $", formatC(min_price, big.mark =",", format ="f", digits =0),"<br>Max: $", formatC(max_price, big.mark =",", format ="f", digits =0) ),colorscale ="Reds",colorbar =list(title ="Price Range ($)")) |> plotly::layout(title =list(text ="Price Extrem Values by U.S. State (Max − Min)"),geo =list(scope ="usa", projection =list(type ="albers usa")) )
Interpretation
Highest ranges: California shows by far the largest price range (over $400M), driven by extremely high luxury property values.
Moderate ranges: States like Florida and parts of the Northeast also show wide price spreads, reflecting diverse markets from affordable to luxury homes.
Lower ranges: Most central and midwestern states have smaller price gaps, indicating more uniform housing markets.
The cities are exempt from this regression as the number of distinct variables would be too large for the model. Therefore for location only states will be used.
Locations like US insular areas like Hawaii and the Virgin Islands appear to be the greatest contributors to price, California being third.
Regarding the numeric variables the number of baths increases housing price the most with around 384.000$ per bath room. Every square meter of the house size adds 16.7$ to the price, each acre of the lot around 14.2$. Interestingly each additional bed appears to decrease the price by around 81.946$. At this point it has to be noted, that these price changes assume that for each variable all other variables would be constant.
# A tibble: 1 × 3
RMSE MAE R2
<dbl> <dbl> <dbl>
1 1018695. 329149. 0.204
The linear regression model achieves a test RMSE of approximately 1,018,695, which is higher than the Random Forest but lower than the neural network. This indicates that while the model captures some systematic relationship between predictors and price, its linear structure cannot fully model the complexity of the housing market. The result suggests moderate predictive ability but clear underfitting compared to more flexible models.
Interpretation for Hyperparameters of Random Forest
Before interpreting the results, it is useful to briefly clarify the meaning of the two Random Forest hyperparameters:
num.trees Specifies the number of decision trees grown in the forest. Increasing num.trees generally reduces variance and stabilizes predictions, but beyond a certain point the performance gains become marginal while computational cost increases.
mtry Defines the number of predictor variables randomly selected as candidates at each split. Smaller values increase randomness and tree diversity, which can reduce overfitting, whereas larger values make individual trees more similar and may increase variance.
The grid search results show clear performance differences across combinations of num.trees and mtry:
The best-performing configuration is num.trees = 300 and mtry = 2, achieving the lowest RMSE (838,263.1).
Increasing the number of trees from 300 to 500 with the same mtry = 2 does not improve performance (RMSE slightly increases to 839,089.1), indicating diminishing returns from adding more trees.
Using only 100 trees leads to a noticeably higher RMSE (846,528.8), suggesting that the forest is not yet sufficiently stable.
Higher values of mtry (3 or 4) consistently result in worse RMSE, regardless of the number of trees. This indicates that allowing too many variables at each split reduces tree diversity and increases overfitting.
importance_df <-enframe(rf_model$variable.importance, name ="Variable", value ="Importance") %>%arrange(desc(Importance))ggplot(importance_df, aes(x =reorder(Variable, Importance), y = Importance, fill = Importance)) +geom_col(show.legend =FALSE) +coord_flip() +theme_minimal() +labs(title ="Variable Importance from Random Forest Model",x ="Variable",y ="Importance" )
# A tibble: 1 × 3
RMSE MAE R2
<dbl> <dbl> <dbl>
1 847439. 224332. 0.449
The Random Forest shows the best performance with a test RMSE of about 842,861, which is a substantial improvement over both linear regression and the neural network. This reduction in error indicates that modeling non-linearities and interactions between variables significantly improves predictions. The model therefore captures the structure of the data more effectively than the other approaches.
Before interpreting the results, it is useful to briefly clarify the meaning of the two neural network hyperparameters:
size Specifies the number of neurons in the hidden layer. A larger size increases the model’s capacity to learn complex, non-linear relationships, but also raises the risk of overfitting and increases computational complexity.
decay Controls the strength of weight decay (L2 regularization). Higher values of decay penalize large weights more strongly, which can improve generalization by reducing overfitting, while a value of zero corresponds to no regularization.
The grid search results indicate limited sensitivity of model performance to most combinations of size and decay, with one notable exception. The best-performing configuration is size = 7 and decay = 0.10, achieving the lowest RMSE (1,100,793). This suggests that a larger hidden layer combined with relatively strong regularization allows the neural network to capture more complex relationships in the data while mitigating overfitting.
# A tibble: 1 × 4
model rmse mae r2
<chr> <dbl> <dbl> <dbl>
1 nnet 1141878. 389883. -0.0000273
The neural network model was trained with one hidden layer and tuned using different combinations of hidden units and regularization (decay). The selected configuration balances model flexibility and overfitting control, allowing the network to approximate complex functional relationships. However, despite this flexibility, the neural network shows weaker performance on the test set compared to the Random Forest and even the linear regression model. Neural networks are sensitive to preprocessing choices such as scaling and encoding, and their performance on structured tabular data is not always superior to tree-based methods. While the model demonstrates the capability to learn non-linear relationships, in this application it does not generalize as effectively as the Random Forest.
# A tibble: 6 × 2
variable rmse_increase
<chr> <dbl>
1 house_size 0.000654
2 bed 0.000654
3 state 0.000115
4 status -0.00000441
5 acre_lot -0.0000238
6 bath -0.000404
The best tuned neural network configuration (size = 7, decay = 0.10) achieves a test RMSE of roughly 1,100,793, which is the highest error among the three models. Despite its theoretical flexibility, the neural network does not generalize as well in this tabular dataset. This suggests that, in this case, increased model complexity does not automatically translate into better performance.
Comparision of the Models
In terms of RMSE, the ranking is: Random Forest (≈ 842k) < Linear Regression (≈ 1,019k) < Neural Network (≈ 1,101k). The Random Forest reduces prediction error by roughly 17% compared to linear regression and by about 24% compared to the neural network, showing a clear performance advantage.
Conclusion
The results show that model flexibility matters, but the type of flexibility is crucial. Tree-based ensemble methods like Random Forest handle structured, interaction-heavy data best, leading to the lowest prediction error. Linear regression serves as a reasonable baseline, while the neural network does not provide benefits in this setting.
Source Code
---title: "Machine Learning Models Gabriel Hamulic"subtitle: "Dataset: US Real Estate Dataset"author: "Hamulic, Gabriel"date: todayembed-resources: trueformat: html: output-file: US-Real-Estate_Gabriel_Hamulic.html #output-ext: "html.html" toc: true toc-location: right code-link: true code-tools: true #df-print: kable theme: light: flatly dark: darkly #echo: fenced pdf: output-file: US-Real-Estate_Gabriel_Hamulic.pdf toc: true number-sections: true code-link: true df-print: tibble crossref: lof-title: "List of Figures"fig-align: centerexecute: warning: false---\listoffigures \listoftables\listoflistings{{< pagebreak >}}# IntroductionThis project applies and compares several machine learning methods to predict residential property prices. Using a structured housing dataset, the full data science pipeline is implemented, including data preparation, feature engineering, exploratory analysis, and regression modeling. Multiple models — including tree-based approaches and a neural network — are trained and tuned, and their predictive performance is evaluated on a separate test set using metrics such as RMSE, MAE, and R².In addition to performance comparison, the project also focuses on model interpretation to understand which features most influence price predictions. The results are summarized in a compact dashboard, providing an overview of model accuracy and key insights. Overall, the project demonstrates how different machine learning techniques can be systematically evaluated in a real-world regression task.## Libraries```{r}#| code-summary: Libraries#| code-fold: truelibrary <-function(...) {suppressPackageStartupMessages(base::library(...))}library(ranger)library(tidyverse)library(dplyr)library(knitr)library(tidyr)library(rmarkdown)library(janitor)library(scales)library(tidytext)library(ggforce)library(GGally)library(DT)library(kableExtra)library(broom)library(plotly)library(nnet)```# Data## Data sourceIn this data exploration we are looking at the US Real Estate market with use of a dataset from kaggle published by Ahmed Shahriar Sakib. It contains over 2.2 Million Real Estate listings broken down to State, Size, Price (among other factors). (Source: <https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset/data>)## Data import```{r setup, include=FALSE}options(dplyr.print_max =15, dplyr.print_min =10)``````{r}data =read.csv("data/realtor-data.zip.csv") # Data import```## Data Transformation```{r}data =subset(data, select =c(status, price, bed, bath, acre_lot, city, state, house_size)) # keep relevant columns```### Structure with standard datatypes```{r}#| label: "data_structure"#| tbl-cap: "Structure of cleaned dataset"#| code-fold: truestructure_tbl <- tibble::tibble(Variable =names(data),Type =sapply(data, function(x) class(x)[1]),Example =sapply(data, function(x) { val <-unique(x[!is.na(x)])[1]if (is.factor(val)) as.character(val) }),Missing =sapply(data, function(x) sum(is.na(x))))kable( structure_tbl,caption ="Structure summary of the dataset",align =c("l", "l", "l", "r"))``````{r}#| code-fold: true# Assign Data Typesdata$status =as.factor(data$status)data$city =as.factor(data$city)data$state =as.factor(data$state)```### NA Removal```{r}#| code-fold: truebefore_rows <-nrow(data)data <-na.omit(data)after_rows <-nrow(data)kable(data.frame(Description =c("Before NA removal", "After NA removal"),Rows =c(before_rows, after_rows)))```The dataset now has `r nrow(data)` observations and `r ncol(data)` variables after removing rows with missing values.### Filtering```{r}#| code-fold: true# Filter min and max valuesdata = data |>filter(price >10000& price <1000000000)```### Calculations```{r}#| code-fold: truedata = data |>mutate(price_per_sqm = price/house_size)```### Cleaned Dataset```{r}#| code-fold: truepaged_table(data)```### Structure after transformation```{r}#| label: "data_structure_after_transformation"#| tbl-cap: "Structure of cleaned dataset"#| code-fold: truestructure_tbl <- tibble::tibble(Variable =names(data),Type =sapply(data, function(x) class(x)[1]),Example =sapply(data, function(x) { val <-unique(x[!is.na(x)])[1]if (is.factor(val)) as.character(val) elseas.character(round(val, 2)) }),Missing =sapply(data, function(x) sum(is.na(x))))kable( structure_tbl,caption ="Structure summary of the dataset",align =c("l", "l", "l", "r"))```After cleaning the dataset, all variables have appropriate data types and no missing values (**n = `r nrow(data)`**):- **status** – Factor variable showing the listing status (e.g., for_sale, sold).- **price** – Numeric value for the property’s price in USD.- **bed, bath** – Integer counts of bedrooms and bathrooms.- **acre_lot** – Numeric size of the lot (in acres).- **city, state** – Factor variables identifying the property’s location.- **house_size** – Numeric size of the house (in square feet).- **price_per_sqm** – Numeric variable derived from price / house_size to compare prices across properties.All rows with missing data were removed, and categorical variables were converted to factors for easier analysis and visualization later on.## Data dictionary```{r}#| code-fold: truetibble(Variable =c("price", "status", "acre_lot", "state", "house_size", "price_per_sqm"),Description =c("The price for which the item was listed on the market","The status if the house is already sold or still for sale","The size of the land / lot on which the house is located in acres","The state in which the house is located","The size of the house in square feet","The price per square footage" )) |>kable(caption ="Description of key variables in the dataset",align =c("l", "l") )```# Summary statistic tablesIn this section we will cover the summary of our cleaned dataset. We will explore basic statistical values from our data.## Numeric Statistics### Summary of numerical values```{r}#| label: "Numeric Statistics"#| tbl-cap: "Summary statistics of numerical variables in dataframe"#| code-fold: truedata |> janitor::clean_names() |>mutate(row =row_number() |>factor()) |>pivot_longer(cols =where(is.numeric)) |>group_by(name) |>summarize(N =n(),min =min(value),mean =mean(value),median =median(value),max =max(value),st.dev =sd(value) ) |> knitr::kable(digits =2)```#### Interpretation**price**: Very wide range (\$10.4k–\$51.5M). Mean (\$573k), median (\$379k), indicating strong right-skew and high-priced outliers.**house_size**: Average \~2,119 sqft, median 1,812 sqft. Extremely large max (1,560,780 sqft) signals outliers. The distribution is right-skewed.**acre_lot**: Median 0.21 acres vs. mean 12.75 acres → a few very large parcels inflate the mean.**bed / bath**: Typical homes (\~3 beds, 2 baths) with modest spread; minima at 1 suggest realistic counts.**price_per_sqm**: Mean \$262.42 vs. median \$197.42, also right-skewed, consistent with price outliers.### Visualisation of numerical values```{r}#| label: "Logarithmic Visualisation"#| tbl-cap: "Visualisation of numerical variables in dataframe"#| code-fold: truedata |>clean_names() |>pivot_longer(cols =where(is.numeric)) |>ggplot(aes(x = value, fill = name)) +geom_histogram(bins =30, alpha =0.7, color ="white") +scale_x_log10(labels =label_comma()) +# 👈 echte Werte, log-Skalafacet_wrap(~ name, scales ="free_x") +theme_minimal() +labs(title ="Distribution of Numerical Variables (logarithmic scale)",x ="Value",y ="Count" ) +theme(legend.position ="none",axis.text.x =element_text(angle =25, hjust =1) )```## Nominal Statistics### Summary of nominal variables (top categories)```{r}#| label: "Nominal Statistics"#| tbl-cap: "Top categories for factor variables with counts, proportions, and mean price"#| code-fold: truetop_n_per_var <-10nominal_summary <- data |>clean_names() |>select(where(is.factor), price) |>pivot_longer(cols =where(is.factor),names_to ="Variable",values_to ="Category") |>group_by(Variable, Category) |>summarise(Count =n(),Percent =round(100* Count /nrow(data), 2),Mean_Price =round(mean(price, na.rm =TRUE), 0),.groups ="drop" ) |>group_by(Variable) |>slice_max(order_by = Count, n = top_n_per_var, with_ties =FALSE) |>ungroup()kable( nominal_summary,caption =paste0("Top ", top_n_per_var," categories per factor variable (counts, share %, and mean price)" ),digits =2,align =c("l", "l", "r", "r", "r"))```#### Interpretation**city**: Most listings in Houston, Tucson, Phoenix, and Los Angeles. Prices range widely — highest in Los Angeles (\~\$1.9M), lowest around \$250k (Saint Louis).**state**: California, Texas, and Florida dominate listings (\>30% total). California shows the highest mean price (\~\$1.1M).**status**: 55% for sale, 45% sold. Active listings are priced higher (\~\$621k vs. \$515k).**Overall**: Listings cluster in major U.S. cities and states, with strong regional price differences, especially high in California and large metro areas.### Visualisation of nominal variables (top categories)```{r}#| label: "Visualization Nominal Statistics"#| tbl-cap: "Top categories for factor variables with counts, proportions, and mean price"#| code-fold: truenominal_summary <- nominal_summary |>group_by(Variable) |>mutate(Category = forcats::fct_reorder(Category, Count),Category =factor(Category, levels =unique(Category))) |>ungroup()# Plot: Facets untereinander, mit eigener x-Skala und y-Skala pro Variableggplot(nominal_summary, aes(x = Count, y = Category, fill = Variable)) +geom_col(show.legend =FALSE, alpha =0.8, width =0.7) +facet_wrap(~ Variable, ncol =1, scales ="free", drop =TRUE) +scale_x_continuous(labels =label_comma()) +#Tausendertrennung, keine 1e+05theme_minimal() +labs(title ="Top Categories per Factor Variable",x ="Count",y ="Category" ) +theme(panel.spacing.y =unit(1, "lines"),strip.text =element_text(size =12, face ="bold"),axis.text.y =element_text(size =8),plot.margin = ggplot2::margin(5, 15, 5, 5) )```# Bivariate Analysis### Pairs Plot (all numeric variables)```{r}#| label: "Pairs Plot"#| code-fold: trueset.seed(123)data_num <- data |> janitor::clean_names() |>select(where(is.numeric)) |>slice_sample(n =3000) |>mutate(across(everything(), log1p)) p <-ggpairs( data_num,progress =FALSE,upper =list(continuous =wrap("cor", size =4, alignPercent =0.8, stars =TRUE)),lower =list(continuous =wrap("points", alpha =0.3, size =0.7)),diag =list(continuous =wrap("densityDiag", alpha =0.7)))p +theme_minimal(base_size =11) +theme(strip.text =element_text(size =8, face ="bold"),panel.grid =element_blank(),axis.text =element_text(size =8),axis.title =element_text(size =9),plot.title =element_text(face ="bold", size =14, hjust =0.5) ) +labs(title ="Pairs Plot (log-transformiert, n=3000)")```#### InterpretationVery high Correlation (0.8) between Price and Price per Squaremeter: The pricier a house the more you pay per squaremeter, which is logical since the squaremeters arent the only criterion for price.Also a minor correlations for price are found in the number of beds (0.34), baths (0.58) and the house size (0.549).The number of beds and baths correlates with the house size.Interestingly the Price per Squaremeter does not correlate with the house size.The lot size interestingly does not correlate with any of the other variables.### Price vs. House Size by Status```{r}#| label: "scatter_price_size"#| tbl-cap: "Price vs. house size by listing status"#| code-fold: true# Stichprobe ziehen für Performanceset.seed(123)sample_data <- data %>%sample_n(50000)plot_ly( sample_data,x =~house_size,y =~price,color =~status,type ="scatter",mode ="markers",alpha =0.6) %>% plotly::layout(title =list(text ="Relationship Between House Size and Price by Status"),xaxis =list(title ="House Size (sqft)"),yaxis =list(title ="Price ($)", type ="log") )```#### Interpretation**Positive relationship**: Larger houses generally have **higher prices**, though the relationship weakens for very large properties.**Status comparison**: Both for_sale and sold homes follow **similar trends**, but **for_sale** listings appear higher in price, suggesting sellers may list above sale values.**High variation**: At similar sizes, prices vary widely — showing the strong influence of **location** and other factors.**Outliers**: A few extremely large or expensive properties stretch the scale upward.### Average Property Price Map```{r}#| label: "state_mapping"#| code-fold: truevalid_states <-tibble(state_name =c(state.name, "District of Columbia"),state_abbr =c(state.abb, "DC"))``````{r}#| label: "map_avg_price"#| tbl-cap: "Average property price by U.S. state"#| code-fold: truemap_price <- data |>group_by(state) |>summarise(avg_price =mean(price, na.rm =TRUE), .groups ="drop") |>inner_join(valid_states, by =c("state"="state_name")) |>mutate(avg_price_k = avg_price /1000)plot_ly( map_price,type ="choropleth",locationmode ="USA-states",locations =~state_abbr,z =~avg_price_k,text =~paste0(state, "<br>Avg Price: $", round(avg_price_k, 1), "K"),colorscale =list(c(0, 1), c("lightblue", "darkblue")),colorbar =list(title ="Avg Price ($K)")) |> plotly::layout(title =list(text ="Average Property Price by U.S. State"),geo =list(scope ="usa", projection =list(type ="albers usa")) )```#### Interpretation**Regional variation**: Western and coastal states show generally **higher property prices**, while central regions are lower.**Highest averages**: States like **California, New York, and Washington** stand out with mean prices well above **\$1M**.**Moderate prices**: States such as Texas, Florida, and Arizona fall in the **mid-range** (\~\$400–650K).**Lower averages**: Midwest and Southern states have more affordable properties on average.**Summary**: Property values are heavily influenced by **geography** — with the highest prices concentrated along the coasts and major urban centers.### Average House Size Map```{r}#| label: "map_avg_size"#| tbl-cap: "Average house size by U.S. state"#| code-fold: truemap_size <- data %>%group_by(state) %>%summarise(avg_size =mean(house_size, na.rm =TRUE), .groups ="drop") %>%inner_join(valid_states, by =c("state"="state_name"))plot_ly( map_size,type ="choropleth",locationmode ="USA-states",locations =~state_abbr,z =~avg_size,text =~paste0(state, "<br>Avg Size: ", round(avg_size), " sqft"),colorscale =list(c(0, 1), c("lightgreen", "darkgreen")),colorbar =list(title ="Avg Size (sqft)")) %>% plotly::layout(title =list(text ="Average House Size by U.S. State"),geo =list(scope ="usa", projection =list(type ="albers usa")) )```#### Interpretation**General trend**: Average house sizes are fairly consistent across most states, typically around 2,000–2,500 sqft.**Larger homes**: Some central and mountain states (e.g., Colorado, Utah, Iowa) show slightly larger averages, possibly due to more available land.**Smaller homes**: Coastal and densely populated states (e.g., New York, California) tend to have smaller average house sizes.### Price Range by US State```{r}#| label: "map_extreme_price"#| tbl-cap: "Price range (max − min) by U.S. state"#| code-fold: truemap_extremes <- data |>group_by(state) |>summarise(min_price =suppressWarnings(min(price, na.rm =TRUE)),max_price =suppressWarnings(max(price, na.rm =TRUE)),.groups ="drop" ) |>mutate(range_price = max_price - min_price) |>inner_join(valid_states, by =c("state"="state_name"))plot_ly( map_extremes,type ="choropleth",locationmode ="USA-states",locations =~state_abbr,z =~range_price,text =~paste0( state,"<br>Min: $", formatC(min_price, big.mark =",", format ="f", digits =0),"<br>Max: $", formatC(max_price, big.mark =",", format ="f", digits =0) ),colorscale ="Reds",colorbar =list(title ="Price Range ($)")) |> plotly::layout(title =list(text ="Price Extrem Values by U.S. State (Max − Min)"),geo =list(scope ="usa", projection =list(type ="albers usa")) )```#### Interpretation**Highest ranges**: California shows by far the **largest price range** (over \$400M), driven by extremely high luxury property values.**Moderate ranges**: States like Florida and parts of the Northeast also show wide **price spreads**, reflecting diverse markets from affordable to luxury homes.**Lower ranges**: Most central and midwestern states have smaller price gaps, indicating more uniform housing markets.# Data EngineeringSplit the data into trainings data and test data```{r}#| code-fold: true#| warning: false#| message: falseset.seed(123)train_indices <-sample(1:nrow(data), size =0.7*nrow(data))train_data <- data[train_indices, ]test_data <- data[-train_indices, ]```### Linear Regression```{r}#| code-fold: truemodel_lr <-lm(price ~ house_size + bath + bed + state + status, data = train_data)tidy(model_lr) |>arrange(p.value) |>mutate(estimate =round(estimate, 1),std.error =round(std.error, 1),statistic =round(statistic, 1),p.value =signif(p.value, 3) ) |>datatable(caption ="Regressionsergebnisse (interaktiv)",filter ="top", options =list(pageLength =10, autoWidth =TRUE,responsive =TRUE ) )``````{r}#| label: "Graphic Regression"#| code-fold: truetidy(model_lr, conf.int =TRUE) |>filter(term !="(Intercept)") |>mutate(term =reorder(term, estimate)) |>ggplot(aes(x = estimate, y = term, fill = estimate >0)) +geom_col(show.legend =FALSE) +geom_vline(xintercept =0, linetype ="dashed") +theme_minimal() +labs(title ="Greatest contributors to price",x ="regression coefficient",y ="" )```The cities are exempt from this regression as the number of distinct variables would be too large for the model. Therefore for location only states will be used.Locations like US insular areas like Hawaii and the Virgin Islands appear to be the greatest contributors to price, California being third.Regarding the numeric variables the number of baths increases housing price the most with around 384.000\$ per bath room. Every square meter of the house size adds 16.7\$ to the price, each acre of the lot around 14.2\$. Interestingly each additional bed appears to decrease the price by around 81.946\$. At this point it has to be noted, that these price changes assume that for each variable all other variables would be constant.### Interpretation of Linear Regression```{r}# Vorhersagen auf Testdatenlm_preds <-predict(model_lr, newdata = test_data)# Kennwerte berechnenlm_metrics <-tibble(RMSE =sqrt(mean((test_data$price - lm_preds)^2)),MAE =mean(abs(test_data$price - lm_preds)),R2 =1-sum((test_data$price - lm_preds)^2) /sum((test_data$price -mean(test_data$price))^2))lm_metrics```The linear regression model achieves a test RMSE of approximately 1,018,695, which is higher than the Random Forest but lower than the neural network. This indicates that while the model captures some systematic relationship between predictors and price, its linear structure cannot fully model the complexity of the housing market. The result suggests moderate predictive ability but clear underfitting compared to more flexible models.### Hyperparameter for Random Forest```{verbatim}set.seed(123)# einfache Grid-Sucherf_grid <- expand.grid( num.trees = c(100, 300, 500), mtry = c(2, 3, 4))rf_results <- rf_grid |> mutate( rmse = map2_dbl(num.trees, mtry, ~ { model <- ranger( price ~ house_size + bed + bath + acre_lot + state + status, data = train_data, num.trees = .x, mtry = .y, importance = "impurity" ) preds <- predict(model, data = test_data)$predictions sqrt(mean((test_data$price - preds)^2)) }) ) |> arrange(rmse)rf_results```### Interpretation for Hyperparameters of Random ForestBefore interpreting the results, it is useful to briefly clarify the meaning of the two Random Forest hyperparameters:**num.trees**Specifies the number of decision trees grown in the forest.Increasing num.trees generally reduces variance and stabilizes predictions, but beyond a certain point the performance gains become marginal while computational cost increases.**mtry**Defines the number of predictor variables randomly selected as candidates at each split.Smaller values increase randomness and tree diversity, which can reduce overfitting, whereas larger values make individual trees more similar and may increase variance. num.trees mtry rmse1 300 2 838263.12 500 2 839089.13 100 2 846528.84 500 3 860607.95 100 3 863708.76 300 3 867879.57 100 4 889648.88 300 4 896180.39 500 4 902455.2The grid search results show clear performance differences across combinations of num.trees and mtry:The best-performing configuration is**num.trees = 300** and **mtry = 2**, achieving the lowest **RMSE (838,263.1).**Increasing the number of trees from 300 to 500 with the same mtry = 2 does not improve performance (RMSE slightly increases to 839,089.1), indicating diminishing returns from adding more trees.Using only 100 trees leads to a noticeably higher RMSE (846,528.8), suggesting that the forest is not yet sufficiently stable.Higher values of mtry (3 or 4) consistently result in worse RMSE, regardless of the number of trees. This indicates that allowing too many variables at each split reduces tree diversity and increases overfitting.## Random Forest```{r}#| code-fold: true#| warning: false#| message: falseset.seed(123)rf_model <-ranger(formula = price ~ house_size + bed + bath + acre_lot + state + status,data = train_data,importance ="impurity",num.trees =300,mtry =2)importance_df <-enframe(rf_model$variable.importance, name ="Variable", value ="Importance") %>%arrange(desc(Importance))ggplot(importance_df, aes(x =reorder(Variable, Importance), y = Importance, fill = Importance)) +geom_col(show.legend =FALSE) +coord_flip() +theme_minimal() +labs(title ="Variable Importance from Random Forest Model",x ="Variable",y ="Importance" )```### Interpretation of Random Forest Model```{r}# Vorhersagen auf Testdatenrf_preds <-predict(rf_model, data = test_data)$predictions# Kennwerte berechnenrf_metrics <-tibble(RMSE =sqrt(mean((test_data$price - rf_preds)^2)),MAE =mean(abs(test_data$price - rf_preds)),R2 =1-sum((test_data$price - rf_preds)^2) /sum((test_data$price -mean(test_data$price))^2))rf_metrics```The Random Forest shows the best performance with a test RMSE of about 842,861, which is a substantial improvement over both linear regression and the neural network. This reduction in error indicates that modeling non-linearities and interactions between variables significantly improves predictions. The model therefore captures the structure of the data more effectively than the other approaches.### Hyperparameter for Neural Network```{verbatim}nn_grid <- expand.grid( size = c(3, 5, 7), # Anzahl Hidden Neurons decay = c(0, 0.01, 0.1))nn_results <- nn_grid |> mutate( rmse = map2_dbl(size, decay, ~ { model <- nnet( price ~ house_size + bed + bath + acre_lot + state + status, data = train_data, size = .x, decay = .y, linout = TRUE, maxit = 500, trace = FALSE ) preds <- predict(model, test_data) sqrt(mean((test_data$price - preds)^2)) }) ) |> arrange(rmse)nn_results```### Interpretation for HyperparametersBefore interpreting the results, it is useful to briefly clarify the meaning of the two neural network hyperparameters:**size**Specifies the number of neurons in the hidden layer.A larger size increases the model’s capacity to learn complex, non-linear relationships, but also raises the risk of overfitting and increases computational complexity.**decay**Controls the strength of weight decay (L2 regularization).Higher values of decay penalize large weights more strongly, which can improve generalization by reducing overfitting, while a value of zero corresponds to no regularization. size decay rmse1 7 0.10 11007932 7 0.00 11418773 7 0.01 11418784 3 0.00 11418785 5 0.01 11418786 3 0.10 11418787 3 0.01 11418788 5 0.00 11418789 5 0.10 1141878The grid search results indicate limited sensitivity of model performance to most combinations of size and decay, with one notable exception.The best-performing configuration is **size = 7** and **decay = 0.10**, achieving the **lowest RMSE (1,100,793)**.This suggests that a larger hidden layer combined with relatively strong regularization allows the neural network to capture more complex relationships in the data while mitigating overfitting.## Neural Network```{r}#| code-fold: true#| warning: false#| message: falseset.seed(123)# Fitnn_model <- nnet::nnet( price ~ house_size + bed + bath + acre_lot + state + status,data = train_data,size =7,decay =0.10,linout =TRUE,maxit =500,trace =FALSE)# Predict on test setnn_preds <-predict(nn_model, newdata = test_data, type ="raw")# Metricsnn_rmse <-sqrt(mean((test_data$price - nn_preds)^2, na.rm =TRUE))nn_mae <-mean(abs(test_data$price - nn_preds), na.rm =TRUE)nn_r2 <-1-sum((test_data$price - nn_preds)^2, na.rm =TRUE) /sum((test_data$price -mean(test_data$price, na.rm =TRUE))^2, na.rm =TRUE)tibble::tibble(model ="nnet",rmse = nn_rmse,mae = nn_mae,r2 = nn_r2)```The observed vs. predicted plot reveals that the neural network produces nearly constant predictions across the entire price range. This indicates that the model failed to learn meaningful relationships and effectively underfits the data. The most likely reason is the lack of feature scaling combined with strong regularization, which prevents the network from adjusting its weights appropriately. As a result, the neural network defaults to predicting values close to the overall mean price.### Interpretation of Neural Network```{r}#| code-fold: true#| warning: false#| message: false# Quick interpretation helpers# 1) Observed vs Predicted (scatter)plot(test_data$price, nn_preds,xlab ="Observed price", ylab ="Predicted price")abline(0, 1)# 2) Permutation importance (simple + model-agnostic)perm_rmse <-function(df, var) { df_perm <- df df_perm[[var]] <-sample(df_perm[[var]]) p <-predict(nn_model, newdata = df_perm, type ="raw")sqrt(mean((test_data$price - p)^2, na.rm =TRUE))}vars <-c("house_size","bed","bath","acre_lot","state","status")base_rmse <- nn_rmsenn_perm_imp <- purrr::map_dbl(vars, ~perm_rmse(test_data, .x) - base_rmse) |> tibble::tibble(variable = vars, rmse_increase = _) |> dplyr::arrange(dplyr::desc(rmse_increase))nn_perm_imp```The best tuned neural network configuration (size = 7, decay = 0.10) achieves a test RMSE of roughly 1,100,793, which is the highest error among the three models. Despite its theoretical flexibility, the neural network does not generalize as well in this tabular dataset. This suggests that, in this case, increased model complexity does not automatically translate into better performance.## Comparision of the ModelsIn terms of RMSE, the ranking is:Random Forest (≈ 842k) < Linear Regression (≈ 1,019k) < Neural Network (≈ 1,101k).The Random Forest reduces prediction error by roughly 17% compared to linear regression and by about 24% compared to the neural network, showing a clear performance advantage.# ConclusionThe results show that model flexibility matters, but the type of flexibility is crucial. Tree-based ensemble methods like Random Forest handle structured, interaction-heavy data best, leading to the lowest prediction error. Linear regression serves as a reasonable baseline, while the neural network does not provide benefits in this setting.